17 research outputs found
Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning
An oft-cited challenge of federated learning is the presence of
heterogeneity. \emph{Data heterogeneity} refers to the fact that data from
different clients may follow very different distributions. \emph{System
heterogeneity} refers to client devices having different system capabilities. A
considerable number of federated optimization methods address this challenge.
In the literature, empirical evaluations usually start federated training from
random initialization. However, in many practical applications of federated
learning, the server has access to proxy data for the training task that can be
used to pre-train a model before starting federated training. Using four
standard federated learning benchmark datasets, we empirically study the impact
of starting from a pre-trained model in federated learning. Unsurprisingly,
starting from a pre-trained model reduces the training time required to reach a
target error rate and enables the training of more accurate models (up to 40\%)
than is possible when starting from random initialization. Surprisingly, we
also find that starting federated learning from a pre-trained initialization
reduces the effect of both data and system heterogeneity. We recommend future
work proposing and evaluating federated optimization methods to evaluate the
performance when starting from random and pre-trained initializations. This
study raises several questions for further work on understanding the role of
heterogeneity in federated optimization. \footnote{Our code is available at:
\url{https://github.com/facebookresearch/where_to_begin}}Comment: Accepted at ICL
PaCo: Probability-based Path Confidence Prediction
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / CCR-042971Gigascale Systems Research Cente
Effective Long-Context Scaling of Foundation Models
We present a series of long-context LLMs that support effective context
windows of up to 32,768 tokens. Our model series are built through continual
pretraining from Llama 2 with longer training sequences and on a dataset where
long texts are upsampled. We perform extensive evaluation on language modeling,
synthetic context probing tasks, and a wide range of research benchmarks. On
research benchmarks, our models achieve consistent improvements on most regular
tasks and significant improvements on long-context tasks over Llama 2. Notably,
with a cost-effective instruction tuning procedure that does not require
human-annotated long instruction data, the 70B variant can already surpass
gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks.
Alongside these results, we provide an in-depth analysis on the individual
components of our method. We delve into Llama's position encodings and discuss
its limitation in modeling long dependencies. We also examine the impact of
various design choices in the pretraining process, including the data mix and
the training curriculum of sequence lengths -- our ablation experiments suggest
that having abundant long texts in the pretrain dataset is not the key to
achieving strong performance, and we empirically verify that long context
continual pretraining is more efficient and similarly effective compared to
pretraining from scratch with long sequences
Critical Branches and Lucky Loads in Control-Independence Architectures
148 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2009.I perform a thorough analysis of the performance sensitivity of CI processors to disambiguation and forwarding. The insights from this analysis are used to drive the design of hardware mechanisms to perform these two functions that are low in complexity and yet attain high performance. The basic premise behind these mechanisms is to use small caches to perform early disambiguation and forwarding. These caches are not responsible for ensuring correctness; they merely enable high performance in the presence of lucky loads. The caches are backed up by a simple load re-execution mechanism that guarantees correctness. I find that the performance of a CI processor with small (32-entry and 128-entry) structures for disambiguation and forwarding, respectively, is within 10% of global load and store queues in the worst case.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD